Homework #6

Due 11:59 pm EST, Friday April 8th, 2022.

Email your solutions (both .ipnb and .html files) to: compscbio@gmail.com.

Background:

A wise, less-sadistic post-doc in your collaborator's lab has generated scRNAseq data from HSPCs exactly as was done in the Weinreb et al 2020 paper. (in fact, it is the same data). She has asked you to analyze it using CoSpar to address the two questions listed below.

The data

  1. scRNAseq data of hematopoietic stem with lineage barcodes This is the raw counts data, as well as lineage barcodes, as we discussed in the lineage tracing lectures.

  2. There is no second data set for this homework.

Your mission:

Analyze the data to answer the following questions

1.) What genes distinguish undifferentiated cells biased towards erythrocytes versus megakaryocytes?

2.) What genes distinguish undifferentiated cells biased towards erythrocytes AND megakaryocytes versus those undifferentiated cells biased towards monocytes AND neutrophils?

Bonus (i.e. extra credit)

3.) What genes distinguish the most multipotent cells from more fate bound but still undifferentiated cells?

4.) Are any signaling pathways enriched in any of the differential expression analyses that you performed in #1-#3? The underlying hypothesis here is that fate biases result from exposure to different signaling mileus. You might explore this using GSEAPY or using the sc.tl.score_genes(). Here is a list of signaling pathway targets, which might be helpful. It was derived as described in Emily Su's paper. You can load this dict object with the following code

from joblib import dump, load sigPathTargets = load("signaling_pathway_targets_040122.joblib")

Important notes:

1.) The first time that you run a CoSpar analysis, run the cs.hf.set_up_folders() function. This will set up some directories that CoSpar assumes are in place.

2.) To answer these questions, you will need to identify the undifferentiated cells that are likely to transition to either of these lineages. Please look at the updated Jupyter Notebook for CoSpar analysis as it contains some code that we did not cover in class to specifcially isolate fate biased progenitors.

3.) You may need to adjust some parameters such as sum_fate_prob_thresh in the tl.fate_bias(). Same is true when defining differentially expressed genes.

4.) The CoSpar documentation might be helpful if you get stuck or want to dig deeper.

The fate maps show the liklihoods of the progenitor cells in red becoming the cell type in teal. The darker the red the more likely that progenitor will become the cell type in teal.

Changing sum_fate_prob_thresh to 0.1 from 0.2 gives us more progenitors. With a thresh of 0.2 we were only getting about 25 for eryth and 0 for meg. I chose this new threshold by trying a bunch of different ones and I think 0.1 gives us the "best looking" graphs because the progens are near the cell type of interest

I originally set the fdr to be 0.95 since we don't have that many to work with and I wanted to see all of them. But I decided on 0.5 since thats already very high but I still wanted to see more genes. Then I decided to plot the lowest two q-value genes for each dge group because group B only has 2 and group b has 2 q-vals under 0.2. Although the q values are a little high, we can still see the seperation between the two cell types pretty well.

Q2

2.) What genes distinguish undifferentiated cells biased towards erythrocytes AND megakaryocytes versus those undifferentiated cells biased towards monocytes AND neutrophils?

In the above graphs, we can seeing a pretty decent split between the two groups with our 4 genes. The only issue is in F2r plot there is are some cells highlighted in mast and baso cell types (this is also the case with some other genes). I think this should be ok though since there are also cells highlighted in eryth and meg and not all genes are going to be exclusive to those two cell types.

q3

In the top 4 graphs, we can see that the cells are alot more clumped in certain cell types while the bottom 4 genes are more dispersed. The more dispersed cells are more likely to be multipotent. I used the original tl.fate_potency function because we are interested in comparing the fates of the undiff'd cells. To select the cutoff, I played around with different "high" percents to get the best looking graphs (one group with genes expressed in cell that are spread out, one group with gene expressed in cells focused in 1 or 2 branches).

q4